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(54) Graphical user interface and method for modyfying pronunciations in text-to-speech and 
speech recognition systems 



(57) A method and user interface which allow users 
to make decisions about how to pronounce words and 
parts of words based on audio cues and common words 
with well known pronunciations. Users input or select 
words for which they want to set or modify pronuncia- 
tions. To set the pronunciation of a given letter or letter 
combination in the word, the user selects the letters and 
is presented with a list of common words whose pronun- 
ciations, or portions thereof, are substantially identical 
to possible pronunciations of the selected letters. The 



list of sample, common words is ordered based on fre- 
quency of correlation in common usage, the most com- 
mon being designated as the default sample word, and 
the user is first presented with a subset of the words in 
the list which are most likely to be selected. In addition, 
the present invention allows for storage in the dictionary 
of several different pronunciations for the same word, to 
allow for contextual differences and individual prefer- 
ences. 
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Description 
Copyright Notice 

[0001] A portion of the disclosure of this patent doc- 
ument contains material which is subject to copyright 
protection. The copyright owner has no objection to the 
facsimile reproduction by anyone of the patent docu- 
ment or the patent disclosure, as it appears in the Pat- 
ent and Trademark Office patent files or records, but 
otherwise reserves all copyright rights whatsoever. 

Background Of The Invention 

[0002] The invention disclosed herein relates gen- 
erally to user interfaces and, in particular, to graphical 
user interfaces for use with text-to-speech and auto- 
mated speech recognition systems. 
[0003] Voice or speech recognition and generation 
technology is gaining in importance as an alternative or 
supplement to other, conventional input and output 
devices. This will be particularly true with continued 
improvements and advances in the underlying software 
methodologies employed and in the hardware compo- 
nents which support the processing and storage 
requirements. As these technologies become more 
generally available to and used by the mass market, 
improvements are needed in the techniques employed 
in initializing and modifying speech recognition and gen- 
eration systems. 

[0004] A few products exist which allow users to 
process files of text to be read aloud by synthesized or 
recorded speech technologies. In addition, there are 
software products used to process spoken language as 
input, identify words and commands, and trigger an 
action or event. Some existing products allow users to 
add words to a dictionary, make modifications to word 
pronunciations in the dictionary, or modify the sounds 
created by a text-to-speech engine. 
[0005] However, users of these products are 
required to understand and employ specialized informa- 
tion about grammar, pronunciation, and linguistic rules 
of each language in which word files are to be created. 
Moreover, in some of these products the means of rep- 
resenting pronunciations requires mastery of a mark-up 
language with unique pronunciation keys not generally 
used in other areas. 

[0006] As a result, these products make text-to- 
speech and automated speech recognition technology 
inflexible and less accessible to the genera! public. They 
require users to become experts in both linguistic rules 
and programming techniques. The inflexibility arises in 
part because these products use general rules of the 
language in question to determine pronunciation with- 
out regard to context, such as geographic context in the 
form of dialects, or individual preferences regarding the 
pronunciation of certain words such as names. 
[0007] Further, the existing products generally pro- 



vide less than satisfactory results in pronunciations or 
translations of pronunciations. The products do not per- 
form well with respect to many types of words including 
acronyms, proper names, technological terms, trade- 
5 marks, or words taken from other languages. Nor do 
these products perform particularly well in accounting 
for variations in pronunciations of words depending on 
their location in a phrase or sentence (e.g., the word 
"address" is pronounced differently when used as a 
10 noun as opposed to a verb). 

[0008] As a result, there is a need for a user inter- 
face method and system which expresses pronuncia- 
tion rules and options in a simple way so that nonexpert 
users can take fuller advantage of the benefits of text-to- 
rs speech and speech recognition technologies. 

Summary Of The Invention 

[0009] It is an object of the present invention to 

20 solve the problems described above with existing text- 
to-speech and speech recognition systems. 
[001 0] It is another object of the present invention to 
provide a simple and intuitive user interface for setting 
and modifying pronunciations of words. 

25 [001 1 ] It is another object of the present invention to 
provide for the use in text-to-speech and speech recog- 
nition systems of sounds or letter groups which are not 
typically used in or even violate the rules of a language. 
[001 2] These and other objects of the invention are 

30 achieved by a method and user interface which allows 
users to make decisions about how to pronounce words 
and parts of words based on audio cues and common 
words with well known pronunciations. 
[0013] Thus, in some embodiments, users input or 

35 select words for which they want to set or modify pro- 
nunciations. To set the pronunciation of a given letter or 
letter combination in the word, the user selects the let- 
ters) and is presented with a list of common words 
whose pronunciations, or portions thereof, are substan- 

40 tially identical to possible pronunciations of the selected 
letters. Preferably the list of sample, common words is 
ordered based on frequency of correlation in common 
usage, the most common being designated as the 
default sample word, and the user is first presented with 

45 a subset of the words in the list which are most likely to 
be selected. 

[0014] In addition, embodiments of the present 
invention allow for storage in the dictionary of several 
different pronunciations for the same word, to allow for 

so contextual differences and individual preferences. 
[001 5] Further embodiments provide for the storage 
of multiple dictionaries for different languages, but allow 
users to select pronunciations from various dictionaries 
to account for special words, parts of words, and trans- 

55 lations. As a result, users may create and store words 
having any sound available to the system, even when 
the sound doesnt generally correspond with letters or 
letter groups according to the rules of the language. 
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[001 6] In addition to modifying pronunciations of let- 
ters in a word, embodiments of the present invention 
allow users to easily break words into syllables or sylla- 
ble-like letter groupings or word subcomponents even 
when the rules of a given language do not provide for 
such groupings as syllables, and to specify which such 
syllables should be accented. As used herein, the word 
syllable refers to such traditional syllables as well as 
other groupings. 

Brief Description Of The Drawings 

[0017] The invention is illustrated in the figures of 
the accompanying drawings which are meant to be 
exemplary and not limiting, in which like references refer 
to like or corresponding parts, and in which: 

Fig. 1 is a block diagram of a system in accordance 
with embodiments of the present invention; 

Fig. 2 is a flow chart showing broadly a process of 
allowing users to modify word pronunciations in 
accordance with the present invention using the 
system of Fig. 1 ; 

Figs. 3A-3B contain a flow chart showing in greater 
detail the process of allowing users to modify word 
pronunciations in accordance with an embodiment 
of the present invention; 

Fig. 4 is a flow chart showing a process of testing 
the pronunciation of a word; and 

Fig. 5-9 contain diagrams of screen displays show- 
ing the graphical user interface of one embodiment 
of the present invention. 

Detailed Description Of The Preferred Embodi- 
ments 

[0018] Embodiments of the present invention are 
now described in detail with reference to the drawings in 
the figures. 

[0019] A text to speech (ITS") and automated 
speech recognition ("ASR") system 10 is shown in Fig. 
1 . The system 10 contains a computerized apparatus or 
system 12 having a microcontroller or microprocessor 
1 4 and one or more memory devices 1 6. The system 1 0 
further has one or more display devices 18, speakers 
20, one or more input devices 22 and a microphone 24. 
All such components are conventional and known to 
those of skill in the art and need not be further 
described here. 

[0020] Memory device or devices 1 6, which may be 
incorporated in computer apparatus 12 as shown or 
may be remotely located from computer 12 and acces- 
sible over a network or other connection, store several 
programs and data files in accordance with the present 



invention. A pronunciation selection program 26 allows, 
when executed on microcontroller 14 for the generation 
of the user interface described herein, the processing of 
a user's input and the retrieval of data from databases 

5 28 and 30. Dictionary databases 28 are a number of 
databases or data files, one for each language handled 
by the system 10, which store character strings and one 
or more pronunciations associated therewith. Pronunci- 
ation databases 30 are a number of databases or data 

10 files, one for each of the languages, containing records 
each having a character or character group and a 
number of sample words associated therewith which 
contain characters that are pronounced in a manner 
which is substantially identical to the way the characters 

15 may be pronounced. The sample words are selected in 
creating the pronunciation databases 30 based on 
grammatical and linguistic rules for the language. Pref- 
erably, the sample words for each character or charac- 
ter group (e.g., dipthong) are ordered generally from 

20 nnore common usage in pronunciation of the character 
to less common. 

[0021] Although shown as two databases, the dic- 
tionary database 28 and pronunciation database 30 
may be structured as one data file or in any other format 

25 which facilitates retrieval of the pronunciation data as 
described herein and/or which is required to meet the 
needs of a given application or usage. 
[0022] The system 1 0 farther contains a TTS mod- 
ule 32 and ASR module 34 stored in memory 1 6. These 

30 modules are conventional and known to those of skill in 
the art and include, for example, the ViaVoice® soft- 
ware program available from IBM. These modules 32 
and 34 convert text stored as digital data to audio sig- 
nals for output by the speakers 20 and convert audio 

35 signals received through microphone 24 into digital 
data. The modules retrieve and utilize pronunciation 
data stored in the dictionary databases 28. 
[0023] A method for allowing users to easily modify 
the pronunciation data stored in dictionary databases 

40 28, as performed by pronunciation selection program 
26, is described generally in Fig. 2 and in greater detail 
in Figs. 3A-3B. Referring to Fig. 2, in accordance with 
the invention a character string, which may be a word, 
name, etc., is displayed on display device 18, step 50. A 

45 user uses input devices 22 to select one or more letters 
from the string, step 52. As is understood, pronunciation 
variations may be linked to individual letters such as 
vowels or to groups of letters such as "ou", "ch", "th" or 
"gh p . The program 26 queries pronunciation database 

so 30 to retrieve the sample words associated with the 
selected letter or letter group, step 54. If the letter or let- 
ter group is absent from the pronunciation database 30, 
an error message may be sent or sample words for one 
of the letters may be retrieved. Some or all of the sam- 

55 pie words are displayed, step 56, and the user selects 
one of the words, step 58. The program 26 then gener- 
ates pronunciation data for the character string using 
the sample word to provide a pronunciation of the 



3 



5 



EP 1 049 072 A2 



6 



selected letter(s), step 60. The string and pronunciation 
data are stored in the dictionary database 28, step 62, 
and the string may be audibly output by the output of the 
TTS module 32 or used to create a speaker verification 
or utterance for ASR module 34. 5 
[0024] The process implemented by program 26 is 
described in more detail in Figs. 3A-3B. An exemplary 
embodiment of a user interface used during this proc- 
ess is illustrated in Figs. 5-9. As shown in Fig. 5, inter- 
face 190 displayed on display device 18 contains: an 10 
input box 200 for manual input of or display of selected 
characters; a test button 202 which is inactive until a 
word is selected; a modify button 204 which is similarly 
inactive until a word is selected; a selection list 206 con- 
sisting of the choices "sound", "accent" and "syllable" is 
(or a "grouping"); and a workspace 208. 
[0025J As explained above, the system 10 prefera- 
bly contains multiple dictionary and pronunciation data- 
bases representing different languages. Referring now 
to Fig. 3A, a user selects one of the languages, step 70, 20 
and the program 26 opens the dictionary for the 
selected language, step 72. To select a word or other 
character string, a user can choose to browse the 
selected dictionary, step 74, in which case the user 
selects an existing word from the database 76. Other- 25 
wise, the user enters a word such as by typing into input 
box 200, step 78. 

[0026] Next, the user can choose whether to test 
the pronunciation of the word, step 80, by selecting test 
button 202. The process of testing a word pronunciation 30 
is described below with reference to Fig. 4. 
[0027] The user can choose to modify the word's 
pronunciation, step 82, by selecting modify button 204. 
If not, the user can store the word and current pronunci- 
ation by selecting the "OK" button in dialog 1 90, step 84. 35 
If the word is not an existing word in the dictionary data- 
base 28, step 86, the word and pronunciation data are 
stored in the dictionary, step 88. As explained below 
with reference to Fig. 4, the pronunciation data for an 
unmodified word is generated using default pronuncia- 40 
tions based on the rules of the selected language. If the 
word already exists, the new pronunciation data is 
stored with the word in the dictionary, step 90, and alter- 
nate pronunciations may be referred to from contextual 
circumstances. 45 
[0028] If the user wishes to modify the pronuncia- 
tion, the three choices in selection list 206 are available. 
[0029] The selected word, now appearing in input 
box 200, is broken into individual characters and copied 
into workspace 208. See Fig. 6. Workspace 208 further so 
shows syllable breaks (the dash in workspace 208) and 
accent marks (the apostrophes in workspace 208) for 
the current pronunciation. 

[0030] If the user selects to modify the syllable 
break, step 92, a breakpoint symbol 210 is displayed, 55 
see Fig. 7. The symbol 21 0 may be moved by the user 
to identify a desired syllable breakpoint, step 94. The 
program 26 breaks any existing syllable to two syllables 



at a selected breakpoint, step 96. 
[0031 ] If the user selects to modify the accent, step 
98, an accent type selection icon group 212 is displayed 
in interface 190 (see Fig. 8). The group 212 contains 
three icons: a primary accent (or major stress) icon 
212a, a secondary accent (or minor stress) icon 212b, 
and a no accent (or unstressed) icon 212c. The user 
selects an accent level by clicking one of the icons, step 
100. The user then selects a syllable, step 102, by, for 
example, selecting a box in workspace 208 immediately 
following the syllable. The program 26 identifies the 
selected syllable with the selected accent level, step 
104, and may further adjust the remaining accents in 
accordance with rules of the selected language. For 
example, if the language provides for any one primary 
accented syllable, and the user selects a second sylla- 
ble for a primary accent, the program may change the 
first primary accent to a secondary accent, or may 
delete all remaining accents entirely. 
[0032] Referring now to Fig. 3B, if the user selects 
to modify a letter sound in list 206, step 106, the user 
selects one or more letters in workspace 208, step 1 08. 
The program 26 retrieves the sample words from the 
pronunciation database 30 for the selected language 
whose pronunciations, or portions thereof, are associ- 
ated or linked with the selected letter(s), step 110. The 
words are displayed in word list 214, see Fig. 9. The 
sample word which represents the default pronunciation 
for the selected letter(s) is highlighted, step 112. See 
Fig. 9, in which the sample word "buy" is highlighted in 
word list 214 for pronunciation of the selected letter V. 
A user can also listen to the pronunciations of the sam- 
ple words. As also shown in Fig. 9, only two or three of 
the sample words may be shown in word list 214, with 
an option for the user to see and hear additional words. 
[0033] If the user selects one of the sample words, 
step 1 14, the pronunciation data, or portions thereof, for 
the selected word is associated with the letter(s) 
selected in the selected word contained in workspace 
208, step 1 1 6. The modified word may then be modified 
further or stored, in accordance with the process 
described above, or may be tested as described below. 
[0034] In accordance with certain aspects of the 
present invention, it is recognized that most languages 
including English contain words taken from other lan- 
guages. Therefore, the user is given (e.g., in word list 
214 after selecting "more 8 ) the option of selecting a pro- 
nunciation for the selected letters from another lan- 
guage, step 118. The user then selects the desired 
language, step 120, and the program 26 retrieves sam- 
ple words associated with the selected letter(s) from the 
pronunciation database file 30 for that selected lan- 
guage, step 122. The sample words are then presented 
for the user's selection as explained above. 
[0035] As a result, a simple and flexible process is 
achieved for allowing users to modify word pronuncia- 
tions. As one example of the ease and flexibility of the 
process, the word "michael" selected in Rg. 9 may be 
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modified from the English pronunciation, "MiK'-el" to a 
Hebrew name "Mee-cha'-el" by adding a syllable break 
between the "a M and u e" and T and "ch", placing the pri- 
mary accent on the new syllable "cha," and selecting 
appropriate pronunciations for the Y, "ch" (e.g., from 
the Hebrew language dictionary), "a" and D e" based on 
common words. No grammatical or linguistic expertise 
Is required. 

[0036] The process of testing a word's pronuncia- 
tion is shown in Fig. 4. If the word already is contained 
in the dictionary database 28, step 140, the stored pro- 
nunciation is retrieved, step 142. If more than one pro- 
nunciation exists for the word, the user may be 
prompted to select one, or a default used. If the word is 
not yet present, then for each letter or letter group, if a 
user has selected a pronunciation using the program 
26, step 144, that pronunciation data is retrieved, step 
146, and otherwise a default pronunciation may be 
selected, step 148. When ali letters have been 
reviewed, step 150, the program 26 generates a pro- 
nunciation for the word using the retrieved letter pronun- 
ciations, step 152. Finally the TTS module outputs an 
audible representation of the retrieved or generated 
word pronunciation, step 154. 

[0037] Because the system described herein allows 
for multiple pronunciations for a single word, the TTS 
module must identify which pronunciation is intended for 
the word. The TTS module can identify the pronuncia- 
tion based on the context in which the word is used. For 
example, the pronunciations may be associated with 
objects such as users on a network, such that a mes- 
sage intended for a specific user would result in a cor- 
rect selection of pronunciations. As another example, 
the TTS module may identify a word usage as noun vs. 
verb, and select the appropriate pronunciation accord- 
ingly. 

Claims 

1. A method implemented on a computer for allowing 
a user to set a pronunciation of a string of charac- 
ters, the method comprising: 

allowing the user to select one or more charac- 
ters in the string; 

retrieving from a database accessible by the 
computer a plurality of samples of words or 
parts of words representing possible pronunci- 
ations of the selected one or more characters 
and displaying the retrieved samples; 

allowing the user to select one of the displayed 
samples; and 

storing a first pronunciation record comprising 
the string of characters with the selected one or 
more characters being assigned the pronuncia- 



tion associated with the sample selected by the 
user. 

2. The method of claim 1, comprising generating a 
5 pronunciation of the character string using the pro- 
nunciation represented by the sample selected by 
the user as the pronunciation for the selected one 
or more characters, and audibly outputting the gen- 
erated pronunciation. 

10 

3. The method of claim 2, comprising allowing the 
user to select another of the displayed samples 
after audibly outputting the generated pronuncia- 
tion. 

75 

4. The method of claim 1, comprising allowing the 
user to select a second of the displayed samples 
and storing a second pronunciation record compris- 
ing the string of characters with the selected one or 

20 more characters being assigned the pronunciation 
represented by the second sample selected by the 
user. 

5. The method of claim 4, comprising, during a text-to- 
25 speech process of generating audible output of a 

text file containing the string of characters, select- 
ing one of the first and second pronunciation 
records. 

30 6. The method of claim 5, comprising associating the 
first and second pronunciation files with first and 
second objects, respectively, and selecting one of 
the first and second objects, and wherein the step 
of selecting one of the first and second pronuncia- 

35 tion records comprises selected the pronunciation 
record associated with the selected object. 

7. The method of claim 4, comprising, during a 
speech recognition process, recognizing a pronun- 
40 ciation of the string of characters by a user and 
selecting one of the first and second pronunciation 
records which most closely matches the recognized 
pronunciation. 

45 8. The method of claim 7, comprising associating the 
first and second pronunciation files with first and 
second objects, respectively, and selecting one of 
the first and second objects which is associated 
with the selected pronunciation record. 

50 

9. The method of claim 1, comprising allowing the 
user to identify a part of the character string as a 
separate syllable, and wherein the step of storing 
the first pronunciation record comprises storing 

55 data representing the identified separate syllable. 

10. The method of claim 1, comprising allowing the 
user to identify a part of the character string to 
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associate with an accent, and wherein the step of 
storing the first pronunciation record comprises 
storing data representing the identified accent. 

11. The method of claim 1, comprising receiving the 5 
character string as input by the user. 

12. The method of claim 1, comprising allowing the 
user to select the character string from a dictionary 
database accessible to the computer. io 

13. The method of claim 1, comprising allowing the 
user to select a preferred language and wherein the 
step of retrieving the samples representing possible 
pronunciations of the selected one or more charac- 15 
ters comprises selecting a database for the pre- 
ferred language from a plurality of language 
databases and retrieving the samples from the 
selected database. 

20 

14. The method of claim 1, comprising allowing the 
user to select a second language for the selected 
one or more characters and retrieving additional 
word samples from a second database correspond- 
ing to the selected second language. 25 

15. A computer program product directly loadable into 
the internal memory of a digital computer compris- 
ing code for performing a method as claimed in any 

of claims 1 to 14 when said product is run on acorn- 30 
puter. 

16. A computer program product stored on a computer 
usable medium comprising computer readable pro- 
gram means for causing a computer to perform a 35 
method as claimed in any of claims 1 to 14. 

17. A graphical user interface system for allowing a 
user to modify a pronunciation of a string of charac- 
ters, the system comprising: 40 

a dictionary database stored on a memory 
device comprising a plurality of first character 
strings and associated pronunciation records; 

45 

a pronunciation database stored on a memory 
device comprising a plurality of second charac- 
ter strings each comprising one or more char- 
acters and each associated with a plurality of 
words, each word having one or more charac- so 
ters which are pronounced in the word in sub- 
stantially identical fashion to one manner in 
which the associated second character string 
may be pronounced; 

55 

an input/output system for allowing a user to 
select one of the first character strings from the 
dictionary database, to select one or more 



10 

characters from the selected string, and to 
select one of the words in the pronunciation 
database; and 

a programmable controller for generating a pro- 
nunciation record comprising the selected first 
character string with the selected one or more 
characters being assigned the pronunciation 
associated with the word sample selected by 
the user. 
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FIG. 2 
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FIG. 3 A 
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FIG. 3B 
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FIG. 4 
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FIG. 5 
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